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Abstract 

This note introduces the method of cross-conformal prediction, which 
is a hybrid of the methods of inductive conformal prediction and cross- 
validation, and studies its validity and predictive efficiency empirically. 

1 Introduction 

The method of conformal prediction produces set predictions that are automat- 
ically valid in the sense that their unconditional coverage probability is equal to 
or exceeds a preset confidence level ([14], Chapter 2). A more computationally 
efficient method of this kind is that of inductive conformal prediction ( [12] , [14] , 
Section 4.1, pQ). However, inductive conformal predictors are typically less pre- 
dictively efficient, in the sense of producing larger prediction sets as compared 
with conformal predictors. Motivated by the method of cross-validation [TTl [T5] . 
this note explores a hybrid method, which we call cross-conformal prediction. 

We are mainly interested in the problems of classification and regression, in 
which we are given a training set consisting of examples, each example consisting 
of an object and a label, and asked to predict the label of a new test object; in 
the problem of classification labels arc elements of a given finite set, and in the 
problem of regression labels are real numbers. If we are asked to predict labels 
for more than one test objects, the same prediction procedure can be applied 
to each test object separately. In this introductory section and in our empirical 
studies we consider the problem of binary classification, in which labels can 
take only two values, which we will encode as and 1. We always assume that 
the examples (both the training examples and the test examples, consisting 
of given objects and unknown labels) are generated independently from the 
same probability measure; this assumption will be called the assumption of 
randomness. 

The idea of conformal prediction is to try the two different labels, and 
1, for the test object, and for either postulated label to test the assumption 
of randomness by checking how well the test example conforms to the training 



set; the output of the procedure is the corresponding p- values p° and p . Two 
standard ways to package the pair (po,pi) are: 

• Report the confidence 1 — min(p ,p ) and credibility max(p°,p 1 ). 

• For a given significance level e £ (0, 1) output the corresponding prediction 
set {y \ p v > e}. 

In inductive conformal prediction the training set is split into two parts, the 
proper training set and the calibration set. The two p- values p° and p 1 are 
computed by checking how well the test example conforms to the calibration 
set. The way of checking conformity is based on a prediction rule found from the 
proper training set and produces, for each example in the calibration set and for 
the test example, the corresponding "conformity score" . The conformity score 
of the test example is then calibrated to the conformity scores of the calibration 
set to obtain the p- value. For details, see Section [2] 

Inductive conformal predictors are usually much more computationally ef- 
ficient than the corresponding conformal predictors. However, they are less 
predictively efficient: they use only the proper training set when developing 
the prediction rule and only the calibration set when calibrating the conformity 
score of the test example, whereas conformal predictors use the full training set 
for both purposes. 

Cross-conformal prediction modifies inductive conformal prediction in order 
to use the full training set for calibration and significant parts of the training set 
(such as 80% or 90%) for developing prediction rules. The training set is split 
into K folds of equal (or almost equal) size. For each k = 1, . . . , K we construct 
a separate inductive conformal predictor using the fcth fold as the calibration 
set and the rest of the training set as the proper training set. Let {p%,p\) be the 
corresponding p- values. Next the two sets of p- values, pi and p\, are merged 
into combined p- values p° and p , which are the result of the procedure. 

In Appendix [A] we consider the most standard way of combining p- values, 
Fisher's method. However, the method produces badly miscalibrated results as 
it assumes the independence of the p-values being combined, whereas in fact 
these p-values are heavily dependent. In the main part of the note, namely in 
Section [3j we, essentially, combine p-values by averaging them. This leads to 
much better calibration; since we have no theoretical results about the validity 
of cross-conformal prediction in this note, we rely on empirical studies involving 
the standard Spambase data set. Finally, we use the same data set to demon- 
strate the efficiency of cross-conformal predictors as compared with inductive 
conformal predictors. Section [4] states an open problem. 

2 Inductive conformal predictors 

We fix two measurable spaces: X, called the object space, and Y, called the 
label space. The Cartesian product Z := X x Y is the example space. A training 
set is a sequence [zi, . . . ,z{] € 7} of examples Zi = (a;,-, y,), where i, £ X are 
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the objects and yt £ Y are the labels. For S C {1, . . . ,1}, we let zs stand for 
the sequence (z Sl , . . . , -z s?i ), where si, . . . , s„ is the sequence of all elements of S 
listed in the increasing order (so that n := \S\). 

In the method of inductive conformal prediction, we split the training set 
into two non-empty parts, the proper training set Zt and the calibration set zq, 
where (T,C) is a partition of {1, ...,/}. An inductive conformity measure is 
a measurable function A : Z* x Z — s- M (we are interested in the case where 
A((,z) does not depend on the order of the elements of £ € Z*). The idea 
behind the conformity score A(zt,z) is that it should measure how well the 
example z conforms to the proper training set Zt- A standard choice is 

A(zT,(x,y)):=A(y,f(x)), (1) 

where / : X — > Y' is a prediction rule found from zt as the training set and 
A : Y x Y' — > R is a measure of similarity between a label and a prediction. 
Allowing Y' to be different from Y (usually Y' D Y) may be useful when 
the underlying prediction method gives additional information to the predicted 
label; e.g., the MART procedure used in Section [3] and Appendix |A"| gives the 
logit of the predicted probability that the label is 1. 

The inductive conformal predictor (ICP) corresponding to A is defined as 
the set predictor 

T*(z 1 ,...,z l ,x):={y\py >e}, (2) 

where e £ (0, 1) is the chosen significance level (1 — e is known as the confidence 
level), the p-values p v ', y € Y, are defined by 

„ \{ieC\ai<gy}\ + l 
P ' |C| + 1 

and 

a,- := A(z T ,Zi), ieC, a v := A(z T , (x, y)) (3) 

are the conformity scores. Given the training set and a test object x the ICP 
predicts its label y; it makes an error if y ^ r e (^i, ■ ■ ■ , zj, x). 

The random variables whose realizations are Xi, yi, Zi, x, y, z will be denoted 
by the corresponding upper case letters (Xi, Y i: Z i: X, Y, Z, respectively). The 
following proposition of validity is almost obvious. 

Proposition 1 ([2], Proposition 4.1). If random examples Z\,...,Zi, Z = 
(X, Y) are i.i.d., the probability of error Y ^ T e (Z\, . . . , Z\, X) does not exceed 
e for any e and any inductive conformal predictor T. 

The family of prediction sets T e (zi, . . . , z%, x), e € (0, 1), is just one possible 
way of packaging the p-values p y . Another way, already discussed in Section [l] 
in the context of binary classification, is as the confidence 1 — p, where p is 
the second largest p- value among p v , and the credibility maxyp 2 '. In the case 
of binary classification confidence and credibility carry the same information as 
the full set {p v | y £ Y} of p-values, but this is not true in general. 
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In our experiments reported in the next section we split the training set 
into the proper training set and the calibration set in proportion 2:1. This 
is the most standard proportion (cf. [9], p. 222, where the validation set plays 
a similar role to our calibration set), but the ideal proportion depends on the 
learning curve for the given problem of prediction (cf. [3], Figure 7.8). Too 
small a calibration set leads to a high variance of confidence (since calibrating 
conformity scores becomes unreliable) and too small a proper training set leads 
to a downward bias in confidence (conformity scores based on a small proper 
training set cannot produce confident predictions). In the next section we will 
see that using cross-conformal predictors improves both bias and variance (cf. 
Table [TJ. 



3 Cross-conformal predictors 

Cross-conformal predictors (CCP) are defined as follows. The training set is 
split into K non-empty subsets (folds) zs k , k = l,...,K, where K £ {2,3,.. .} 
is a parameter of the algorithm and (Si, . . . , Sk) is a partition of {1, ... , I}. For 
each k £ {1, . . . , K} and each potential label y £ Y of x find the conformity 
scores of the examples in z Sk and of (x, y) by 

a tM := A(z s _ k ,Zi), i £ S k) a v k := A(z s _ k , (x, y)), (4) 

where S-k '■= Uj^kSj and A is a given inductive conformity measure. The 
corresponding p-values arc defined by 

„„ ._ £fc=l W £ S k I a »,fc < a t}\ + 1 / r \ 

p - TTi ' [ ) 

Confidence and credibility are now defined as before; the set predictor T e is also 
defined as before, by ([2]), where e > is another parameter. 

The definition of CCPs parallels that of ICPs, except that we now use the 
whole training set for calibration. The conformity scores Q are computed as in 
(|| but using the union of all the folds except for the current one as the proper 
training set. Calibration ([5| is done by combining the ranks of the test example 
(x,y) with a postulated label in all the folds. 

If we define the separate p- value 

v _ Kj € S k | ai,k < a y k }\ + 1 

Pk '- ISfcl + i {b > 

for each fold, we can see that p y is essentially the average of p v k . In particular, 
if each fold has the same size, |Si| = • •• = \Sk\, a simple calculation gives 

P y = P y + (P y - 1) « P\ (7) 

where p v := ~^J2k=iPk 1S ^ e arithmetic mean of p\ and the rj assumes K <g.L 
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In this note we give calibration plots for 5-fold and 10-fold cross-conformal 
prediction. We take K £ {5, 10} following the advice in [5] (who refer to 
Breiman and Spector [3] and Kohavi [10j). In our experiments we use the 
popular Spambase data set. The size of the data set is 4601, and there are two 
labels: spam, encoded as 1, and email, encoded as 0. 

We consider the conformity measure where / is output by MART (0, 
Chapter 10) and 



MART's output f(x) models the log-odds of spam vs email, 



which makes the interpretation of ([8]) as conformity score very natural. (MART 
is known [9] to give good results on the Spambase dataset.) 

The R programs used in the experiments described in this section and the 
appendix have been uploaded to arXiv. The programs use the gbm package 
with virtually all parameters set to the default values (given in the description 
provided in response to help ("gbm")). The only parameter that has been mod- 
ified is n. trees, the number of trees, which should be as large as possible and 
whose default value was clearly insufficient. 

Figure [T] gives the calibration plots for the CCP and for 8 random splits of 
the data set into a training set of size 3600 and a test set of size 1001 and of the 
training set into 5 or 10 folds. There is a further source of randomness as the 
MART procedure is itself randomized. The functions plotted in Figure [l] map 
each significance level e to the percentage of erroneous predictions made by the 
set predictor r e on the test set. Visually, the plots are well-calibrated (close to 
the bisector of the first quadrant). 

As for the efficiency of the CCP, see Table [T] The biggest advantage of the 
CCP is in the stability of its confidence values: the standard deviation of the 
mean confidences is much less than that for the ICP. However, the CCP also 
gives higher confidence; to some degree this can be seen from the table, but the 
high variance of the ICP confidence masks it: e.g., for the first 100 seeds the 
average of the mean confidence for ICP is 99.16% (with the standard deviation of 
the mean confidences equal to 0.149%, corresponding to the standard deviation 
of 0.015% of the average mean confidence). 

4 Conclusion 

Conformal prediction and inductive conformal prediction are two approaches to 
the theory of tolerance regions (see, e.g., [S]). The known validity results for 
conformal and inductive conformal predictors can be expressed by saying that 
they are 1 — e expectation tolerance regions, where e is the significance level (see 
Proposition [I] above for the case of ICPs). It is also known ([2], Proposition 2a) 
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Seed 1 2 3 4 5 6 7 Average St. dev. 



mean conf., ICP 
mean cred., ICP 




99.25% 
51.31% 


99.23% 
50.37% 


99.00% 
49.93% 


99.17% 
52.45% 


99.30% 
48.98% 


99.12% 
50.34% 


99.38% 
50.18% 


99.25% 
52.00% 


99.21% 
50.69% 


0.109% 
1.074% 


mean conf., K — 
mean cred., K = 


5 
5 


99.22% 
51.06% 


99.17% 
49.70% 


99.17% 
50.26% 


99.24% 
50.63% 


99.27% 
49.81% 


99.27% 
49.42% 


99.30% 
50.88% 


99.30% 
51.40% 


99.24% 
50.39% 


0.050% 
0.664% 


mean conf., K — 
mean cred., K = 


10 
10 


99.24% 
51.02% 


99.20% 
49.69% 


99.20% 
50.23% 


99.23% 
50.71% 


99.26% 
49.70% 


99.28% 
49.42% 


99.34% 
50.89% 


99.32% 
51.39% 


99.26% 
50.38% 


0.048% 
0.678% 



Table 1: Mean (over the test set) confidence and credibility for the ICP and the 
5-fold and 10-fold CCP. The results are given for various values of the seed for 
the R pseudorandom number generator; column "Average" gives the average 
of all the 8 values for the seeds 0-7, and column "St. dev." gives the standard 
deviation of those 8 values. 



that inductive conformal predictors are 1 — 8 tolerance regions for a proportion 
1 — e for suitable 8 and e. On the other hand, at this time there are no theoretical 
results about the validity of cross-conformal predictors, and it is an interesting 
open problem to establish such results. 
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A An approach based on Fisher's method 

In this appendix we will briefly discuss an approach to cross-conformal predic- 
tion leading to miscalibrated set predictions. 

Fisher's method [3] of combining p- values pi, ■ ■ ■ ,Pk, valid when the K p- 
values are independent, combines them into one statistic — 2^ fc=1 lnp/c having 
the chi-squared distribution with 2K degrees of freedom. The corresponding 
p- value will be denoted F(p±, . . . ,pk)- 

F(p 1 ,...,p K ):=p(^ X 2 >-2j2lnp^J , (9) 

where \ 2 is a random variable having the chi-squared distribution with 2K 
degrees of freedom. 
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Naive cross-conformed predictors are defined as follows. The training set is 
split into K subsets, as in the case of CCPs. For each k € {1, . . . , K} find the p- 
values p\ via Q. Define p y := F(pf , . . . ,p y K ), and then define confidence, cred- 
ibility, and set predictors ^ as before. In other words, naive cross-conformal 
predictors are defined in the same way as cross-conformal predictors except that 
the function F is defined by (|9| rather than by the expression following the = 
in 0. 

Figure [2] is the analogue of Figure [T] for naive cross-conformal predictors. It is 
obvious that the set predictions are badly miscalibrated; the p- values computed 
from different folds are heavily dependent. This may appear counterintuitive, 
but the reader should remember that we are dealing with a somewhat unusual 
kind of hypothesis testing in this note (and in the theory of tolerance regions in 
general): instead of testing some properties of the data- generating distribution 
we are testing hypotheses about data. 

We do not give the efficiency results (such as those given in Table [T]) for the 
naive CCP since efficiency without validity is meaningless. 
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Figure 1: Left panels: the calibration plot for the cross-conformal predictor 
with K = 5 (top) and K — 10 (bottom) folds and the first 8 seeds, 0-7, for the 
R pseudorandom number generator. Right panels: the lower left corner of the 
corresponding left panel (which is the most important part of the calibration 
plot in applications). 
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analogue of Figure [T] for the naive cross-conformal predictor. 
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